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(S7)Ab8tract 

Tlic method dt dib tovcntion identifies disdnctive items <rf fafpnnation from a laigcr body of infomiatt<m on die basis of similarities 
or dissimilaridcs among die items and achieves a significant increase hi speed as wefl as die ability to balance die ieg«sci^vcncss and 
diversiQr among die identified items by applying selection crit^ to nrndomly chosen subsamples of aU die infitmnation. Hie memod is 
illttstmied widi itfcicnce to die compound selection requiremcmts rf medicinal dicmists. Compound sdecdon mediods cuacndy available 
to dicmists ait based on maximum <» minimum dissimilarity selecdon ot on hierarchical clustering. The mediod of die uiventi<»i is 
moie general and incoiporates maxfanum and minimum dissimilarity-based selection as special cases. In addition, die number of itoations 
lequiied to sdect die items is a multiple of die grwq) siise which, at its greatest, is approximately die square root of die popuUti<m size. 
Thus die sdecdon mcdiod runs much fteter dian die mediods of die prior art PWher, by adjusting die subsample size parameter JT, itjs 
possible to contid die balance between icpresentativeness and divcraity in die compounds sdected. In addition, die mediod can mumc die 
distributi<mal pmpeities of selections based on hterarchicd dustedng and. at least in some cases, improve iqxm diem. 
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WO 98/46998 ^ 

AN OFIIMAL DISSIMILARITY METHOD FOR CHQOSING DISTINCTIVE 
ITEMS OF INFORMATION FROM A LARGE BODY OF INFORMATION 

BACKGROUN n THF. INVENTION 
A portion of the disclosure of this patent document contains material which is 
5 subject to copyright protection. The copyright owner has no objection to the facsira^^ 

repioduction by anyone of the patent document or the patent disclosure, as it appears in. the 
U.S. Patent and Trademark Office, WO, or any national patent office patem 
records, but otherwise reserves aU copyright rights whatsoever. 

YtfM nf the Invention: 

10 A method is presented which identifies distinctive items of information from a larger 

body of information on the basis of similarities or dissimilarities among the items. More 
specifically, the method achieves a significant increase in speed as well as the abiUty to 
balance the representativeness and diversity among the identified items by applying 
selection criteria to randomly chosen subsamples of all the information. 

15 nescriprion o f Bar.kpround Art: 

Most disciplines from economics to chemistry are benefiting from the abiUty of the 
modem computer to store and retrieve vast amounts of data. While tiie chore of actoially 
handling the data has been reduced, the abUity to ga*^^^ 

enormous amount of data has, in. itself, created new problems. In particular, in some cas^ 
20 the amount of data has become so vast that it is difficult to comprehend the range of data, 
much less to understand and 4erive fneanin^l information from it To address ttiis 
prdblem, attempts have been made in the prior art to find ways to m«?ajui^ft*y groi^ or 
abrtract sonie structure from the information. 

For instance, m inany atuations it is usefiil to use a stratified samplmg proce^iwe; 
25 for g^g ag^egate information about a^wpulation within whi<*. different groups c^^^^^^^ 
diffeiait amounts to survey, or whose response have dafetent meanings to the questioner, 
etc. Howevtr, ivhile the stratified sampling, a^jroad^ is very effic^ to use it fios must be 
able to quicMy get information about each duster within the population, and be able to 
select representative people to poU. The metiiod of this invention permits the r^id 
30 evaluatioii?Of the demographic profilesi in such a situation to see liow inany pwple in a 
random samfAenieif dosest!? to;ea(^- sdeetee^.SmaU.vbutinf(pp^^ , 
groups-eaKiitttM^'^'^suiveyedi'it^ A.>r.-..'''-fiV!'A^r>- ; ■ 
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The approach of this invention could also aid in the design of clinioal tiials. It often 
hsq;}pais that drugs tested in a small random sample fail ialba because of severe adverse > 
reactions in a small subpopulation. Using the method of this invention, a selection based on 
medical and family history could produce a more useful early phase test population. The 
S following discussion of the method of this invention is presented in terms useful to 
medidnal chemists who are ooacemed witii identifying subpopulations of chemical 
compoundis. However, as can be seen, from the brief ^camples above, the method of the 
invention is gcaexzl and can equally well be applied to other fidds by those skilled in tiie 
art. The goioality of the metiiod is readily sq>preciated if for the term "compound" used in 
10 this disclosure, the term "data dement" is substituted. Brief examples of other spedfic 
applications of the methodology of the invention are set out at the end of the disclosure. 

The advent of combinatorial chemistry and high through-put screening has made the 
ability to identify "good" subsets in large libraries of compounds very important, whetti^ 
the libraries in question have actually been synthesized or oust as. assemblies of virtual 
15 molecular representations in a computer. One kind of a good subset is one which contains 
members which represent the chemical div^ity inherent in the entire library while at the 
same time not containing more members than are necessary to sample the inherent 
diveraty. This desire for a good subset is driven by the fact that these libraries can contain 
an enormous number of compounds whidi, if evoy angle oonqtound were tested^ would bf 
20 extr^ely costiy in tofms of mon^, time, and lesouices to indhddually ev^uate. 

Essentially, what is desired is a subset of a sdze whidi is reasonable to^test. Clearly, good 
subsets can be generated based upon ottier critoia as well, such as ease of syntheas or : 
lowest cbstf Traditionally such subsds have been created by expert systems « i;e. , having f 
medidnal or pesticide chemist sdect compounds manwdly based on a series 2E! struptui$;s.r 
25 This s^roach i$ labor^iiit^nme and is'dq)endent on the;ei^)eEt used; Mor^^sei, it is 

nettiier toutindy piactidii nor very good for more than 300*1080 compQimds;^ an#lhenTOi4y 
whra the library in question includes cme or Auxe homologous series. In ther prior art, 
cuirentiy available alternative approaches for sdecticm Include maximum disamilarity 
sdection, minimum dissimilarity selection, and hierarchical clustering, among oth^K Bach 
30 of the available methods^ can be effective, but each has some intrinsic Umitations.; ; ^ ^ 
Maximunt Dissimiliairityi The methods cunentiy most often^used for sdys^^ 
conqwuiRlis fodi^ cm ifiaxiiMziitg tiie> div^^'of^the selected subsett^^witii reii^QC^. tO/'^^^ 
as a whole uang a descriptor (medic) whidi diaraeteiizes tife memberstoff^tiie^ set andj an 
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assodatfid (dis)siinilarity measuie'*^. The basic approach is stiaightfprward, an^ utilizes as 
parameters a minimum acceptable dissimilarity (redundancy) threshold R and a maximum 
selected subset size M^. The approach is essentially as follows: 

1. Select a compound at random from the dataset of interest, add it to the 
5 selection subset, and create a pool of candidate compounds out of the 

r^emaind^ of the dataset. 

2. Examine the pool of candidates, and^ using thcf characterizing measure 
(metric), identify the candidate which is most dissimilar to those which haye 
already been selected, 

10 3. Determine whether the dissimilarity of the most dissimilar candidate is less 

than R (redundancy test). If it is less than R, stop. If it is not less than R, 
add that candidate to^ th^ selection set and remove it from 
candidates. 

4. If the compound to be selected in this step is the third conipound being 
15 selected, after its selection return the &st two selections to the pool of 

candidate compounds, (The first selection was chosen randpmly, and the 
second selection is strongly biased by the first. Transferring them back into 
the candidate pool reduces the. effect, of the initial random selection.) 

5. If the desired subset size M has^Ji^ 

20 6. If there are no more candidates in the; pool, stpp^ If there are more 

candidates in the pool, go back to step 2. 
A related method developed by Agrafiotis works by comparing evolving subsets to 
maximize the diversity across tiie entire sed MaTumaU 

biased towards inclusion of outliers; i.e., those candidates most disdmilar fiom the group 
25 as a whole, itt some situations, this is very use&l proper^, J)ut medid^ chemist^: te^d to 
avoid outliers in making iheir own selections because they may not "look like" drugs. In 
some cases, oudiers in corporate databases are outliers for good reason - difficulty of 
syntiiesis or to>dcity, for example - which reduces theit value as potential leads. Moreover, 
a maximally diverse subset may not b& ad»iuatel^ representative of the biochemical 
30 diversity in a^ dataset - c . 

Oiie justifiqatipn in drug leseardi for maxiimang diyersiQ^ is based^on experimental 
deagn don^draations commoidy employed. foff.anaIy2ingv quantitatiy^^ ^b^cture/ac^ 
relationships (QSARs),^ where outiiers.are important because they have the greatest weight 
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The Ubraries ftom which subsets are to be selected are usually much more diverse and 
much larger in size than those used for QSAR, however. In such a situation, outliers loose 
their statistical leverage because it may no longer be possible to use the QSAR approach to 
adequately approximate biochemical responses as linear or quadratic functions of the 

S descriptors (metrics) being used. 

Minimnm nissimilaritv: Recentty, Robert Pearlman et dl. introduced an alternative 
approach to compound sdection (the "eUmination method" in DiverseSolutionS' ) which can 
be characterized as minimum dissimilarity selection. The approach tto the same two 
parameters as maximum dissimilarity selection - a minimum dissimilarity threshold R and 

10 , the maximum number of compounds to select - but appUes them differently as 
follows: 

1 . Select a compound at random from the dataset of interest, add it to the 

selection set, and create a pool of candidate compounds out of the remainder 
of the dataset. 

15 2, Examine the pool of candidates, and, using the characterizing measure 

(metric), remove any for which the dissimilarity to the most recent selection 
is less than R. 

3. If thoe are no more candidates in the pool, stop. 

4. Select a compound at rahdom from the^pobl of candidates. 
20 5. If the desired subset size M has been reached, stop. 

6. Go back to stq) 2. 

Notice that in each iteration, the similarity test (step 2) is applied only with respect 
to the most recent compo^iiid sdected at that iteration. This linuts ti» method to pwrwise 
measures of dissimilarity and can not be used with measures between single compounds and 
25 a set of compounds. As HoUiday and Will^ have pwnted out, the definition of set-wise 
dissimilarity can radically affect the outcome of a dissimflarity-'based sdection methddj' just 
as the linkage method used can affect the outcome of hierarchical clustering.'^'' 

llfinimuih disamilarity selection tends to be order dependent*; i.e., the members 

included will be dqpehdeiit on tiie initial random selections. This can be alleviated by 
30 setting very high, so tiiat tiie mettiod runs to exhaustion; i.e. , runs but of candidaties 
beto M ti itisUdted' [stopstat step 3J. For nwst daiasetsj doing' samlh^^n eit^ 
justified value for J? (eig , 0.15 ^ 0;2 for Taitihroto^di^ will 
T«i«tWn ««i Mti-Hpieirtthiv' VaivA 1^^^^ it is^nece^saj^' to rebcat'tfae 
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minimum dissimilarity selection several times on the same dataset in order to find a value 
of R which will yield the desired number of selections when minimum dissimilarity is run 
to exhaustion. If a reasonable value of and a neighbo^hppd^° radius R are used^ 
the minimum dissimilarity method will return a subset similar to one chosen by random 
S selection, where the only difference is that there will be no redundant selections, as defined 
by R. Such a selection will be repres^tative but may not be diverse enough to sajdsfy some 
chemists. 

Hierarohical Clustering: In agglomerative hierarchic^ clustering, the most similar 
pair of clusters are consolidated at each leyel^ starting from one singleton cluster for each 

10 compound in the set. Selecting from clusters obtained using Ward's or complete-linkage 
methods^'* returns subsets which are both reprjesentative and diverse, in that each compound 
is rq>resented by some neighbor and the represratatives are distributed across the full 
breadth of the dataset. Medicinal chemists generally find selections based on hierarchical 
clustmng intiutively appealing and natural, especially after they have eliminated "oddball" 

IS clusters and singleton compounds from the selection list. Indeed, they sometimes make 
their own selections by manually clustering structures. By examining the dissimilarity 
between the most similar clusters at each stq>, one can identify the inost "natural" level of 
clustering near the number of compounds one wishes to select - i.e.^ levels at which the 
last dustm consolidated Viem substantially more siniilar than any remaining pkisters to 

20 »ch other. u/r:,/--. 

Hierarchical clustering is not always a practical option, however. The speed of the 
classical technique decreases with the cube of the size N of the dataset, and so becomes 
slow for large libraries. In addition, computer memoir requirements ge^^ restrict 
direct applications to relatively anall datasets (:S2000 compmmds)« I^aster aiqpirq^es, 

25 including ledpfocal nearest ndghbors (RNI^,^ are available which relieve the mcfrnpiy 
linutations and am usuaUy reduce scaling problems dramatically.^ However, evm iii the 
best case, the speed is inversely proportional to N^. Unfortunately, the scaling benefits can 
only be fully realized when centroids are well-defined and well^diaved in the mefidc space 
being explored, which is not the case for some important metrics of interest - in particidar, 

30 for Tanimoto sinularities between^ 2D fingerprints.^^^^^ 
PfffinttiftDS; 

N shall mean the oitire set of elemrats contained^inia dats^ fromr whichvit is 
desired to select a subset. 
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M shall mean the size of the subset of elements which it is desired to select. 
R shall mean an exclusion critoion chosen by the user. One example^of such a 
criterion is a minimum dissimUaiity threshold betweoi the data elemoits. 
K shall mean a subsample size chosen by the user. 

5 STTMMARY OF THE INVENTION 

The method of this invention addresses the problem of how to select from a very 
large dataset of information, a smaller subset for detailed analysis and/or evaluation which 
reflects the full range of information contained in this large dataset. The presoit niethod 
significantly improves on the methods available in the prior art in two significant ways. 

0 First, it is possible to adjust the balance for a data subset betwera how representative the 
selected subset is and how diverse it is. Second, the method achieves the desired selection 
much faster than any comparable method. In particular, while the number of fundamental 
steps to obtain a subset of a given size in prior art methods are multiples of the size of the 
dataset N or powers of N*, it has been discovered that, with the method of tlus invention, 

15 the number of steps required is a multiple of M which, at its greatest, is approximatdy the 
square root of N, v^. 

In prior art selection methods large datasets are examined as a whole. In Uie method 
of this invention a subsample K of the entire dataset size N is defined and examined. The 
method of this invention is more general than those of the prior art, and, in fact, reduces as 

20 limiting cases to the ndnimum and maximum dissimilarity methods of the prior ait for 
values of K = 1 and K » N. The tradeoff between how representative or how diverse the 
selected subset is may be varied by tiie user by selecting intermediate values of K. 

^RTPFDRSnRTPnONQPTHEDRAWMOS 

Figure 1 diagrammatically shows the compounds chosen at each of the first three 
25 iterations of a sdectioh i^erformed by the method of this invention { R = 6mm (0ri^nal> 
scale), 5, 5: 3). Symbols rqE>resenting compounds which^have already 1^ 
sdected aie Med in solid, ilie dides arouhd esu^ solid symbol show the nunimum 
disipmflftrity tadius i?; daric hash symbQl& indiaite coihpounds ocamined a^eadi' iteration 
which are excluded as too simUar to tiiose whidi have already bera sdected. >Sti|»pl^ -' r.- 
30 symbols indicate candidates in the recycling bihy he., which have already been considered 
for sdection once. i^;^ ;|ir J 

Figlure 2' shm the design of the>^^ 
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Figuie 3 shows the distribution of scalar descriptors for the combinatorial dataset. 
(A) Data for all 1000 compounds is displayed in three dimensions. (B) Distribution for ISO 
compounds in thcf XY plane. 

Figure 4 shows the relationships among compound classes in the combinatorial. 
5 dataset as shown by hierarchical clust^g on UNITY 2D fingerprints." 

Figure 5 shows the diveigrace of OpdSim sdectims from uniform sampling across 
pharijiiacological classes for the MannholdyR^do^ dataset*^ as a function of subsample size 
K. The statistic is scaled by division by its 5 d^rees of freedom. , 

Figure 6A shows the relationship between the average dissimilarity p between 
10 unsdected compounds and sets of 21 compounds selected by OptiSim as a function of the 
average dissimilarity 6 among selected compounds. One of two sets of 100 compounds 
randomly selected firom the combinatorial dataset was used to calculate p for each trial. The 
data shown is for UNITY 2D fing^rints from the combinatorial dataset. (•) K = 1; (o) K 
= 2; (■) K = 5; (□) K = 10; i*)K = 15; (0) AT = 25; (a) K = 35; (v) K = 1000; 
15 (— ) spUne curve drawn through the mean values of p and 6 for each value of K. 

Figure 6B shows the relationship betweoi the average dissimilarity p between 
unsdected compounds and sets of 21 compounds select by hierarchical clustering as a 
fui^on of the avoage^isamiluity $ among sdected compounds. (■ , □) selections made 
from hioarelucal comi^de ^i^e clusters!'; (4 i v) sdections made from hierarchical 
20 group average cluster^; (n, v) one compound was randondy sdected from each cluster; 
(■, a) one compound was randomly selected from among the three most cential 
compounds in each clust^; ^line curve eonesixinding to that shown for Optisim 
sdections in Figure 6Av 

Figure ? is a d^a^ammatic flow sheet pfthe invention; 
25 Figure 8 shows the jcp^gmm^ produced by ^vsing s^^ poiq> average, 

and conq>lete linkage hinaichical dustering to sdeet 30 compounds. 

Figure 9 shows dendrograms analogous to those of Figure 8 for Optiaim selections 
made with iC » 1, 3, and 5. The compliete^Jnkage denklrognun from Figure 8 is repeated 
forcomparison^ ■ - v • 

30 Figwe 10 shows a sdiemadc deadxogam fqr an <Qptisim>:^ust)niiBg Irom a sdection 

setof 10^coiiipoundi5.>'-'' ''t§''-''-- .-: 

Figure U is a. plQt 6f^th$f distribution; of elu^ data set 

for angle, group, and complete linkage. The bar codings are ti^e same as shown in Figure 
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8. 

Figures 12 and 13 show the comparison of the cluster size distributions obtained : 
from Optisim clusterings based onK= U 2, 3, and 5 to the complete linkage clustering. 
The bar codings aie the same as shown in Figure 8. 
S r>F/;rRTPnoN op the pref rrrrd embodiment 

Hblliday and Willett have pointed out* that their own" and othei^ dissimilarity-based 
selection methods were actually specific manifestations of a more genieral, unified method^ 
- and in particular, they addressed the question of what similarity meant when comparing one 

compound to a group of compounds. Lance and Williamis' have done much the same 
10 analysis for hierarchical clustering. In a similar way, maximum arid minimum dissimilarity 
selection can be reformulated as limiting cases of a single, more general method, tiie 
"OptiSim" method of this invention. The generalization entails introduction of a parameter 
K which defines a subsample size at each iteration. The method can be outlined as follows: 

1. Select a compound at random from the dataset of interest, add it to the 

15 selection set, and create a pool of candidate compounds out of the remainder 

of tiie dataset. Create an empty recycling bin. 

2. Randomly sdect a subsample set of X compounds (or all remaining) from the 
candidate pool. If necessary, transfer compounds from the recycling bin to 
the candidate pool and kamtondy sdect a suffidmt number of compounds (if 

20 posable) to complete a subsample of aze K. 

3. Detennine whether any of <he c6ni|k)unds in the ^^M^ 

' dissimilarity less tiian R witii respect those compounds already selected. (In 
tiie case of the first pass, this will only be die initially randondy selected 
compound.) Remove any si<^ conKp6uii(ds from the^ s^^ iq)laoe 
25 diem witiiodier compounds* frbm^e cm 

4. If the can^d&te pool is dduiii^» rentove iall compoUn^ from the reveling 
bin and put them back into the (^didate pdoL 

5. Rq)eat step 3 until die subsample includes K compounds at least R dissimilar 
to those already selected or until the candidate pool is exhausted. 

30 6. If the 8i*simiple lis eniiptyj, stic^ 

7. If the subsample is not empty, examine die subsample, and^identify' a 
: '*^««ttik)uhdma3dnndlydissh^^ 

S A«M Amnnrtiind IdisntifiAil iin isten 7 tin' tlle^ j5d«etibhv<8lsii''iand»im 
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from the subsample. 
9- Put those compounds in the subsample which were not selected into the 
recycling bin. 

10. If the desired selected subset size M has been reached, stop. If the desired 
5 selected subset size M has not been reached, return to step 2. 

A prefoied alternative formulation of the niethod can be set out as follows: 
1. Select a compound at random from the dataset of interest, add it to the 

selection set, and create a pool of candidate compounds out of the remainder 

of the dataset. Create an empty recycling bin. 
10 2. Randomly choose another compound from the dataset and determine if it has 

a dissimilarity less than R with respect to those compounds already selected 

for the selection set. If it does not have a dissimilarity less than R, place it in 

the subsample* 

3. Repeat step 2 until the subsample includes K compounds at least R dissimilar 
IS to those already selected or until the candidate pool is exhausted. 

4. If the candidate pool is exhausted but the recycling bin is not exhausted, 
remove all compounds from the recycling bin and put them back into the 
candidate pool. 

5. If the candidate pool is not ^diausted and there are fewer than JiT compounds 
20 in the subsamidei,. gp to stqp 2. 

6. If flie suhsarapl!^ 

7. If the subsan^ple is not «mpty , examine the subsample and identi^ a 
compound maximally dissimilar to those already selected. 

8. Add the compound identified ^ step f7 to the sdection s^ and ismove i^^^^^ 
25 from tiie subssunple , i v 

9. Put th(;^ compounds in the subsample, wtuqh were not added to the; 
selectipn set, into the recycling bin. 

10. If the desired selected subset size M has ^^^b^ stop. If the desiied 
selected subset size M has not been reached, return to st^ 2. 

30 Figure 7 shows the mQthpd of this inv^tion? set put in flowsheet f<wrin?U. , 

The results of the s^Ucation- Q^^ 
iterations are mustrsued wh^ w FigMre 1^ whsise^P^i W ^y^ 

compounds; • 
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Clearly, the maximum and minimum dissimilarity selection methods can be seen 
simply as extreme instances of this Optimal Dissimlarity Selection approach*^ - OptiSim, 
for short. For maximum dissimilarity, K is effectively A^, the number of elements in the 
dataset: all compounds are considered as candidates at each step. The only substantive 
5 difference from the prior art maximum similarity method is that the first two selections are 
not dropped (step 4 in the maximum dissimilarity approach), so the selections obtained are 
slightly more representative and less diverse than would be the case for the original 
method. 

For minimum dissimilarity, K is simply 1. In addition, the similarity test (step 3) is 

10 applied to each candidate with respect to all elements which have already been selected. 
This method broadens the range of dissimilarity measures which can be used over the 
vision of the minimum dissimilarity method described above. 

An important aspect of the present invention is that, by choosing an intermediate 
value of K, it is possible to strike a balance along the continuum betwe^ the diversity of 

IS the maximum dissimilarity approach and the represratativeness of the minimum 

dissimilarity approach. As shown below, it turns out tiiat one can mimic selection based on 
hierarchical clustering as well. Thus, the method of this invention is more general than 
prior art approaches and, when desired, reduces to or mimics the prior art approaches. 
Experiinerital Methodology: Combiniatorial dataset design. 

20 Gmerally, all calculations and analyses to conduct combinatorial chemistiy 

scre^iing library design and follow up are implemented ki a mod^ computational 
chemistry environment uinng software deigned to handle molecular structures and 
associated properties and operatiohs. For purposes^^of this Application, such an environment 
is specifically referenced. In paiticular, the computational environment and capabilities of 

25 the SYBYL and UNITY software programs devdoped and marketed^by^^lMiMs; Inc. (St 
Lx)uis, Missouri) are spedfically utifized. Uiiless otherwise' nbtibdV a^^^ refermces 
and commands in the follomng text are refoences to fiirictionalitiite contained in the SYBYL 
and UNITY software programs. Software with similar functionalities \o SYBYL and UNITY 
are available from otho* soutces, both commercial and ndn-commeMalf weU^^^^k^^ to 

30 those in the art. A geheial purfk) digitsQ dompii&r with' aM 

meniory and hard disIc^koragcT is^^^^^ 

perforihirig the mdhods of thi¥ihvm^^^^ df'thdWds^d^moM 
molecular structures as wdl as other data may need to be stored simultaneously in the^ - ' 
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random access memory of the computer or in rapidly available permanent storage. The 
inventors use a Silicon Graphics, Inc. Challenge-M computer having a single ISOMhz 
R4400 processor with 128 Mb memory and 4Gb hard disk storage space. SPL (Sylbyl 
Programming Language) code to implemoit a prefnred method of this invention is set 

5 forth in Appendix "A". 

The L^on" combinatorial builder module in SYBYL" was used to create a 
homologous set of libraries, each with a pyrimidine, a cyclohexane or a pyridine at the 
core (Figure 2) and an analogous pattern of substitution around each ring. The pyrimidine 
and cyclohexane libraries consisted of 2244 compounds each, whereas the pyridine library 

0 -was made up of 6000 compounds. A composite library was built from the 6000-compound 
pyridine dataset, SOO randomly selected cyclohexanes and 100 randomly selected 
pyrimidines. The final da.taset of 1000 compounds - 892 pyridines, 92 cyclohexanes and. 16 
pyrimidines - was created by randomly selecting from among the 6600 compounds in the 
composite library. The Tanimoto dissimilarity T* (equivalent to the Soergel distance"*^') 

15 was used to assess the dissimilarity betwe^ any two compounds a and b in the dataset: 

7" - 1 - Of Bite Occuring € Both Molecules 
No. Of Bits € EUher Molecule 

Hie Tanimoto fm^iprint simply expresses the degree to which the substructures found in 

both comTOunds is a large ^tion of the total substructures. Standard UNITY" 2D 

fingoprints were used for evaluating dissimilarities, 

Hiiee scalar descriptors were goi^ted for the combinatorial dataset by drawing 
20 three numbers for each compound from a uniform random peculation of reals between 0 

and 1, then squaring each number and multiplying it by 99. Adding 1 to the integer part of 

each value produced three descriptors for each compound with values betwe»i 0 and 100. 

These were distributed indq)endoitly of the 2D structure and, hence, of the corresponding 

fingerprints. The distribution of values in the r^ulting scalar three-desbriptor space forms a 
25 gradient of density running out firom the concentration of points near the origin (Figure 3). 

Dissimilarity was evaluated in terms of Euclidean distance when using these descriptors to 

evaluate the performance of OptiSim on multiple scalar descriptors* 

cqipiyrisons! Subsets generated by Optimal Dissimilarity Selection at various 

subsampling levels K were characterized non-paiametrically in terms of the statistic: 
30 where Oi di^otes the observed count obtained ^m the i^ of c clusters and E; denotes the 
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count expected from that cluster** ( the term "cluster" is used here because the comparisons 
of most immediate interest are between cluster-based selection and OptiSim results; 
"cat^ory" or "class" would be equally sq>propriate). 

The statistic with respect to a random sample is a measure of how rq>resaitative 
S a particular selection set is. In gennal, tiie expected count for random sampling is given 
by: 

Ei(random) = b Mtii/ N 
where b is tiie number of trials in each block; M is the number selected per trial; n, is tiie 
number of compounds in tiie cluster; and N is the total number of entities being selected 
10 from. For random sampling, one expects to select most often from the most populous 

cluster, and to sdect proportionately less often from smaller clusters. The larger a sdection 
set's value of (random), the more it diverges from bang represratative of the dataset as > 
a whole. 

Note tiiat tiie OptiSim method of tills invraition explicitiy precludes re-selecting any 
15 compound, whereas tiie random selection distribution is for sampling witii replacement. As 
a result, tiie selections are not strictiy independent of each otiier and (2) is not exact This 
is not a problem if tfie number of selections per trial does not greatiy exceed tiie number of 
clusters. It is tiioi necessary, however, to blodc trials if one is to keq> the expected huhibdr 
of selections for each cluster large enough to avoid having to make a continuity conedti(Ai 
20 to tiie statistic calculated in (I).'* 

If selection is perfectiy uniform across clusters, each cluster will be equally tikdy to 
be sampled. How uniformly a selection is distributed across c clusters, tiien, is a measure 
of how sinular a result is to cluster-based selection, tiie (uniform) statistic is th^fore 
calculated from: 
25 Ei (uniform) = Z»Af/c 

Again, perfect concordance gives a x^ of 0. The smallar x^ (uniform) is, the better the 
result mimics cluster-based selection. 

Scaling: As noted above, a perfectiy random sdection will have a x* (random) of 0, 
and a piafectly unifatrm selection will ha^ a (un^brm) of dVBotii reiii^^ ibwevisr, are 
30 quite unfik^y when selections are made independently. For dther statistic, the meaii >r 
emectfid bv chance is eaual to the decrees of freedom ( df = c - 1 ). Scalins bv this 
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population mean makes it easier to compare experiments which involve different numbers 
of clusters (i.e., catteries of clasafication), since it makes the expected result equal to 1 
no matter how many clusters are involved. 
Experimental R esults - Power of Ootisim Methodology: 
5 Scalar diescriptors: A hierarchical clustering was done in Selector" using complete 

linkage on the three scalar descriptors generated for the 1000 compounds in the 
combinatorial dataset, which resolved the dataset into ten clusters. Two clusters ^lit one 
comer of the cubical descriptor space; the rest cover the remaining comers and the crater. 
The ten clusters obtained were made up of 283, 167, 134, 130, 90, 78, 45, 29, 24 and 20 
10 compounds, respectively. Not surprisingly, the largest cluster is near the origin and the 
smallest is at the opposite v^x of the cubical boundaries of the descriptor space. 

Optimal Dissimilarity Selection was then applied to the dataset 20 times at each of 
seven different subsampling rates K, selecting ten compounds (M^ = 10) each time; R 
was set to 10. The same random number was used to "seed" the method for one trial at 
IS each subsampling rate, so any biases due to the initial selection were spread equally across 
all values of K. The number of selections from each duster was summed across a block of 
ten such trials, so that each block included 100 selections for each subsample size. Hie 
statistics obtained were then averaged across two such blocks. The total number of scalar 
selections was therefore: 
20 (10/trial) x (10 trials/block) x (2 blocks) x (7 subsampling rates) ^ 1400 

in 140 trials, with a total of 200 selections made for each subsampUng rate. 

The dis^butions of these selections across clust^ w^ thj^ compai!^ to the totiils 
expeded for landom sampling and U> tiie totals expected for uiufopo se^ectipJ) from tiie tea 
clusters. Table 1 shows the values obtmed as averages across two blocks for eatph value; 
25 ,pfK. ■ . •,. ..V 
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Divergence of subsets selected from the combinatorial dataset from random and 
uniform distributions across clusters. Values cited are in terms of scaled chi 
squared_(x^ /degrees of freedom). 
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jcala&idf=5i 


finperonnts (df =20) 






random** 


uniform 


TsndomMnifprm 




1 


1.23±0.43'' 


7.21 ±0.14 


0.89 


15.05 


5 


2.06±0.50 


3.56±0.51 


4.04 


7.36 


10 


2.58±0.11 


2.44±0.11 


11.15 


5.56 


15 


5.04±0.66 


1.98±0,24 


13.75 


3.94 


25 


5.88±1.16 


1.93±0.38 


20.32 


3.38 


35 


7.61 ±2.89 


2.02±0.07 


26.74 


2.08 


1000^ 


7.59±0.72 


0.67±0.15 


49.84 


2.05 



15 



a 
b 
c 
d 



Subsampling rate at each iteration. 

Reference distribution. 

Mean± SEM for two blocks. 

All unselected compounds considered at each iteration. 



The (random) for minimal dissimilarity selection (K = 1) is not sig;nifi(antiy 
20 different from that expected by chance ( 1.23 vs 1.0), but it rises steadily with increasing K 
as the sdected subsets grow more divo^ and less rcfnesentative. The V (umfonn) profile, on 
the other hand, Ms quickly with increasing ii^ to a plateau value of about 2 for K « 15-35. 
The maximal dissimilarity extreme (K = 1000) produces a significantly more uraforin 
distribution (0.67 vs ~2.0), which reflects the fiact that a. cluster is, in this case, located at 
25 each of the first nine linear D-optimal points of the descriptor space - the comers plus the 
colter. 

Note: maximum dissimilarity selection (K « 1000) performs unusually well here with 
respect to hioarchical dustoing in part because the dimensionality (3) is mudi smalls than 
(10), and in part because the peak population density near the origin coinddes with one 
30 Mtremeof the descriptor space. 

rnmhinatnrial finpemrints: Hierarchical clusterine of the combinatorial dataset using 
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2D fingerprints gave 21.clust^ as a "natural" level (see discussion above of hierarchical 
clustering). These included 19 "pure" clusters made up of 262, 152, 75, 64, 64, 45, 43 or 17 
pyridines; 29, 21, 11, 10, 6, 5, 5, 3, or 2 cyclohexanes; and 8 or 2 pyrimidines. Two mixed 
clusters contained both pyridines and pyrimidines - 161 and 2, or 9 and 4, respectively. The 

5 relationships between the sublibraries shown by this clustering pattern are illustrated 
schematically in Figure 4. Note that Uie areas of each set shown in Figure 4 indicate the 
number of compounds of that class which are in the dataset, not the degree of diversity within 
each class. Because they are drawn from homologous combinatorial libraries, the degree of 
structural variety found within each class is similar. This is reflected in the similar numbers 

10 of pyrimidine and cyclohexane clusters; that fewer pyrimidine clusters were identified simply 
reflects their scarcity. 

Twenty-one compounds were selected in each trial (M^ - 21), and distributions 
across clusters were summed across 10 trials for each block at each of seven values of K; R 
was set to 0. 15.'-'° Again, there was one trial at each subsampling rate for each random 

15 number seed used; the seeds used were differait from those used for analyzing the associated 
scalar descriptors.. In this case, blocks were not rq)licated. Hence the total numb^ of trials 
was: 

(21/trial> x (10 trials/block) x (7 subsampling rates) » 1470 
in 70 trials, with a total of 210 selections made at each of 7 values of J^. 

20 The values of scaled found with reflect to random and uniform distributions are 

shown in Table 1. Again, the OptiSim selections move laway from being purely represratative 
and begin to resemble clusterrbased sele(^on even at low values of K. Note that under this 
high-T^imen^ionalf non^£ui;aidean m^jic the 

Again« maximum dissimilarity (K « 1000) returned the most uniformly distributed 

25 sdecdon set. Note, : howev:err that the. number of supeislusters (3 - one for each of the 
constituent combinatorial core structures) is small compared to (here, 23). The 
distribution of compounds chosen using maximum dissimilarity selection can be 
uncharacteristically uniform in such a situation. 

Mannhold-Rekker dataset! The combinatorial dataset described above is very structured 

30 and artifi^. was created so d^berately in an effort to keq)> evaluation of the selection 
method r QptiSim > . separate Aom the pon^detations of the appropiiatniess of the particular 
metric -being used. Nonethdess, it is important to know how well the methodology performs, 
with more realistic datasets. Mannhold era/. '^ compiled a database of 68 structurally diverse 
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compounds from six pharmacological classes for evaluatinig different ways of predicting a 
compound's octanol/water partition coefficient P from its structure. The dataset includes 15 
class I antiarrhythmics, 13 phenothiazines, 12 class III antiarrhythmics, 11 j8-blockers, 9 
benzamides and 8 potassium channel openers. OptiSim was used to select six compounds from 

5 the dataset in nine trials with set to 1, 3, 5, 7, 10 or 68. Again, standard UNITY 2D 
fingerprints were used to evaluate the dissimilarity betwerai molecules. 

In this case, the uniform reference distribution was based on pharmacological classes. 
Because there is relatively little variation in population between classes in this dataset, 
however, the analysis applied above to the combinatorial dataset is not very informative, 

10 particularly vis ^ vis random sampling. Instead, (uniform) was calculated for each trial, and 
the results averaged across the nine trials for each subsampling rate K. The results obtained 
are shown in Figure 5; selection of one example from each pharmacological class was best 
reproduced at 
/C = 5-7. 

15 Note that here, where the underiying complexity of the dataset (6) is comparable to 

Af^, intermediate values of J5: outperform maximum dissimilarity. In fact, OptiSim actually 
performs somewhat better with respect to the pharmacological classes in this dataset than does 
duster-based sele(^on based on 2D fingerprints and complete-linkage (data not ^own), in part 
because the variation in structures within classes is uneven. 
20 Parametric measures of diversity and of fepresentativeness: The statistic is a 

good measure of similarity for validation work, but it requires a reference distribution arid so 
is of limited usefulness when no hierarchical classification is available for coinparison. If 
Optimal Dissimilarity Sdection is to be used as a simogate for hierarchicaii dus«ei4figV itictt^ 
readily accessible measures will be needed to know when the optimal bjflance between 
25 representativeness and diversity has been obtained for a given subsampl^y ^ in any pffl^ 
application. ' 

It is convwiient to use avoages when characterizing large datasets, because the law of 
large numbers guarantees that a good estimate of the average can usually be obtained from a 
random sample of the entire dataset. " If S is the set of M compounds" sdected^iirid^ is the set 
30 of n compounds chosen randomly from aiiiohg thosecompounds whieh were itot sefeGtedWfl^ 
the average dissiriiiterity 8 between each^compoOiid*ift and the ote'^^^iwmpdundsvte S^^» ii 
mcaSuit of diversityi whereas the avera^^^ 
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evaluating the dissimilarity between each compound and the reference set is being used, so: 

M 



Figure 6A shows a plot of p as a function of 6 for fingerprints from the combinatorial dataset; 
for clarity, only five trials at each of seven levels of subsample size K arc shown. Data iare 
also -shown in Figure 6B for both unbiased random selection of one compound from each of 
5 the hierarchical clusters described above (labelled "random" in Figure 6B), and for random 
samplings from among the three most central compounds in each cluster" (labelled "central" 
in Figure 6B; this is a Tanimoto space, so "central" compounds are not necessarily 
rq)resentative). Results from both complete link^e and group average hierarchical clustering 
are shown. 

10 The scatter in p and 8 among selection sets obtained at the same value of K reflects the 

random samplihg edmponoit of the meAod. The variance in 6 fiadls sharply as K increases, 
with a standard dewatioh (SD) of 0.043, 0.012 and 0.010 ati: 1, 5 and 35, respectively. 
For any particular dataset and a given M^, there is a characteristic limiting value for 6. In 
this case, 8^ is sUghUy less than 0.7. 

15 - The variance in p, on the bther hand, is prinoiariiy detimnined by the ladidomly chosen 
unsdected compounds (U) to wluch the selection sets su^ 'com^aur^^^^ i^^^ 
slighdy with increaahg K. Nbte that for OptiSim seliectibn sets, p and 5 will both be smallest 
(on average) when « 1. Mbreover, the expected valu^ for p and 6 will be equal if arid only 
if: S and U are the same size; R = 0; and /iT = 1. t 

20 Ais expected, an increase in div^ity (5) cdmes at the cost of ia decrease in 

reprRsentativeness i(increase in p). Also as expected, minimum dissimilarity sdectibn (K « 1) 
returns a sdiectibh sk indre fe^resdil^ve' 0bwer p) but less diversd' (IbWisr 8) th^ does 
cluster-based S(decd6n. Miajdtai^^^ 

some tost in fepresehtativehess, jiiurtibularl^ respfebt tii selebtibri frbh^ 
25 compounds ih each'cia&t^l?^)^'^^^' ■ ^''v^"-.?-^*-'*^ h^^u, >r .-i^inrr^^- 
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subsampling rates, OptiSim returns sdection sets which are both more representative (lower 
p) and more diverse (higher 8) than are the cluster-based selections. This difference in the 
quality of the selections made accounts for the failure of (uniform) to fall much below 2 for 
this dataset when hioarchical clustering is taken as the reference distribution. 

S Recall that the hierarchical clustering used as a reference is based on complete linkage. ' 

Ward's method is an alt^ative approach which uses a probabilistic, least-squares rationale 
to maximize distances between clusters while minimi^ng distances within clusters unde;r 
Euclidean metrics.^" Evidently, Optisim's sampling approach imparts similar properties to its 
selection sets. This is potentially a quite useful generalization, since Ward's method is not 

10 applicable to metric spaces in which centroids of clusters are not well-defined - in particular, 
it is not directly applicable to Tanimoto coefficients based on 2D fingerprints or other bit set 
descriptors. 

It has been determined by applying the method of this invention through extensive 
testing, that there is little or no advantage in selecting a value of K greater than the square root 

15 of N. For instance, it can be seen in Figure 5 that the scaled has a minimum (~ 5) for a 
K< VTS . In addition, in Figure 6A, it can be seen that most of the diversity (6) obtainable 
(as indicated by = N, 1000) is achieved with a = 35, approximately the square root of 
1000. Alsp, by the time i: is as large as-^/N, it can be seen that (p) representativeness is being 
lost (p higher value) indicating, perhaps, U»at in this exsunple K = naay already be too 

20 high. Thus, the fundamental number of s^ps for the method of this invention is proportional 
to or less than Vn. 

Thfsrp vs 6 plot such as in Figure 6A is one way to compare different sdection sets. 
Another v^y if iQi^pLy hi^aidiieal elustqrwg to the sdection sets themsdvfs. It is thsp, 
possibl!?^ tp, Qoppare the d«idrogrami produced by such analyses to quaUu^v^y jpcsss |h^;: 

25 similsp^es ^d differences bet>yeen the selec^pn sd$. Thi$ is illtisirated for singly Ijiiilcig^^ 
group average, and complete linkage hierarchical clustCTing methods in^ Figure 8. T^e 1000 
compound mixed dataset was partitioned into 30 clusters for each method, and a representative 
compound wsis drawn from among the three most centr^ compounds in each cluster. Those 
rq>r^e^ta^ons were then th^pselves dustoed to give th^ de^ridrograms shqwii 

30 the s§ke of.j^ic^^st^cy, .this. secondanr clustiepng. was done usin^ «^mpletf;«l|9ls|g^!jp^;C^ 
case. The class p/ compouncis makiijg up, each cluster ?s indiw^iied at tlje b 
dendrogram, where each vertical line corresponds to the selected\i^j?|5^fn^ ^^' PM- 
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Based Qn the diendrpgrams and on the relative number of pyrimidine and pyridine 
structures sdected, it is clear that the single linkage is the most biased towards unusual 
structures (pyrimidines) and so gives the most diverse, but least representative selection. 
Indeed, many of the single linkage clusters are pyrimidine singletons, and the 90% of the 

5 dataset comprised of pyridines are lumped together into just seven large clusters. Complete 
linkage clustering, on the other hand, is skewed towards a better representation of the more 
common class of compounds (pyridines) and is correspondingly less diverse. Group average 
clustering provides an intermediate balance in discrimination among and between classes. ^ . 
Figure 9 shows analogous dendrograms to those of Figure 8 but obtained using. the 

10 Optisim method with K = I (minimum dissinularity), K = 3, and ^: = 5. At the smallest 
subsample size (/T = 1), the selection is almost purely rq)resentative, with only one pyrimidine 
and two cyclohexanes being sdected. This reflects their numerical contributions to the dataset 
(2% and 9% respectively) reasonably well. As expected from the p vs 5 plot of Figures 6A 
and 6B, the complete linkage selection is mimicked reasonably well by the AT = 5 selection 

15 set. This comparison illustrates another aspect of the similarity between the Optisim 
methodology and cluster based selection. 
Variations and Extensions Of The Basic Optisim Methodology: 

The heart of the OpUmal DisamUaiit^ Sdection approach lies in taking a smes of 
nuidom subsamples from. a dataset, and ^elec^ng the 4:!est ,csmdidate from each subsample, 

20 wh»e "best" is defmed by some preference criterion which is a function of those cpmpounds 
(or, more generally, elements) which have been selected in previous steps. Several useful 
variations on the mdhod disclosed in this document will be inimediately s^ 
skil^ in the ai|. The fdloy^lig^ ^r9th$xsver^tble4.j^^^^^^ 
docui|i)<;9t are ^nsidiered within ^t^^ 

25 Other evaluation criteria: PptiSim.has bee»: defined hem in tecins^iof maximum 

minimum pairwise dissimilarity tp those cspmppunds already selected as the criterion by wMeh 
the "best" compound is to be selectei^ ^pm each subsample. The highest average dissimil^P^ 
could just as well be used, and the cosine co^cient" could be substituted for the Tanimpto 
coefficient. In fact, the meUiod is generalizable u> any set of sde^on cii^^ pr dec^ion rules 

30 - best priced pr most syntheticaUy acces^ble^ cpmpound^ for example, pr Ipwest pnc&in fd^e 
highest quaitile in dissinularity. The redundancy, test agamst J? can also be d^ 

broa^y* ■ ■ -y - ...v t > ' h;^.:, .s.-y .i?' ^ ^v- 

Similarly, sdectipii fpr each subsample; neetd not be randpm^ but^catiibeHsy$;emsLtically^ 
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biased towards compounds in combinatorial libraries which share reagent "parents" with 
compounds which have already been selected* An analogous scheme is currently implemented 
in DiverseSolutions^ for descriptors of low dimensionality in a cell-based selection system. 
Sampling with replacement: As set out in Figure 1, each OptiSim subsample is drawn 

S from the candidate pool for consideration, then is set aside until all candidates have been 
examined once - i.e., the dataset is sampled without replacement. If samples are drawn with 
replacement of those compounds which do not get selected, no recycling bin is required. A 
given setting of K will then return a more representative but less diverse subset because 
sampling of more sparsely populated (outlier) regions will be avoided. The tradeoff will be that 

10 a particular level of diversity among the compounds selected will only be approached at higher 
values of K, which can become computationally expensive. 

Replacement of redundant compounds in the subsample: The implementation described 
in Figure 1 tests each compound in the subsample for redundancy, and replaces any redundant 
compounds in the subsample before selecting the best one in that subsample. An alternative 

IS approach is to apply the redundancy test only to that best compound; if it is redundant - that 
is, if it is too similar to those which have already been selected - no sdection from that subset 
is made. This approach can be made faster "up front*' than the version of OptiSim set out in 
Figure 1 for some descriptors and similarity measures, but wiU be correspondingly slower at 
latff selection steps. In addition, the balance between representativeness and diversity wSl be 

20 shifted towards making more representative selections, just as subsampling with replacement 
WiU. 

Dataset clustering : As demonstrated here, OptiSim selection sets behave very much 
like sefection sets based on hiosurchical cluistairig. Opt^ selections can be used as centers 
(IXm as leaders^) for efficiently clustering large datasete on the basis of a secondary similarity 
25 rsulius, or by assignment of each compound to the most similar center. Mdx^over, sdected 
compounds can themselves be submitt^ clustering. Unjier this scenario; the 

OptiSim selections will be true centers of their 1?' neighborhoods*^ for any metric, so their 
hioarchy will perforce accurately reflect hierarchical r6iationships across the entire dafiaset. 

This is illustrated schematically in Figure 10 for an OptiSim sdection set of 10 
30 compouridst with each compound being sissigned to the cluster correspdnding tt> the most 
similar member of iiir selection set^ Not^ that the ptbcess scsdes intrihslcaUy with (I^I^ xM 
rather than scaling with N', as hierarchical clustering does (N^ where RNN^ can be used). Such'^ 
rihrii^im ir^hKti>!rinfr in;iWa« with cithe#clus^riin&f m^tHods with 
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respect to cluster size distributions* Figure 1 1 shows characteristic cluster size distributions for 
single linkage, group average and complete linkage methods. Note that here, "size" refers 
simply to membership count, not to how much of fingerprint space each cluster povers. 
Pyridine memberships are shown as dark bars, whereas cyclohexanes are indicated by hashed 
5 bars and pyrimidines are indicated by stippled bars. Circles indicate the location and identity 
of singleton clusters. 

For single linkage (top), most of the pyridines (89% of the entire dataset) constitute a 
single large cluster, and only one other cluster of any size is found. The pyrimidines, many 
of the cyclohexanes and a few pyridines are spread zmong singletons and doubletons. The 
10 clusters obtained using group average clustering show a more useful level of resolution among 
the pyridines, but still discriminates too effectively among the pyrimidines. Among the 
methods illustrated in Figure 11, the most intuitively appropriate balance between 
rqpresentativeness (in this case, discriminating among pyridines) and diversity (discriminating 
among pyrimidines) seems to be struck by complete linkage clustering (bottom), 
15 Figures 12 and 13 compare the cluster size distributions obtained for OptiSim 

clusterings based m K » 1 and 2 or 3 and 5, respectively. At the smaller subsample size, the 
clustering is representative in that the pyridines are well distributed across clusters; the 
uncommon compounds - here, cyclohexanes and pyrinudines - fall into just one or two 
clusters each. Indeed, in some cases cyclohexanes get lumped in with pyridines. A^^^^^ 
20 subsample size grows, discrimination among the more distinctive compound closes ijicreases, 
until at K - 5 (middle) the distribution is quite similar to that seen for complete Unkajge 
clustering (bottom). s . 

There is one respect in which the OptiSim clustering for J^v^ 5 differs appi^i^ly 
from that for complete linkage. Note that the cyclohexane clusters are considerably mjG|r^ 
25 in size in the latter case. This reflects the fact that the OptiSim clusters can andii^ 

their diameters (maximum pairwise dissimilarity within the cluster) somewhat to compensate 
for •* local** variations in scale and/or spatial distribution. This flexibility is an advantage oyer 
hierarchical methods in general , and complete linkage in .particular. It arises because tlie 
OptiSim selection method iS/uUierently progressw^ the first selection by definition: repiresents 
30 the whole dataset^: ;and subsequent selections fiU in details of ;{uea$/:not yetv^adequ^ 
represented. Indeed, such progressivity is one of the desirable hallnisu^ksi of divisive d^^ 
methods^ ^ which are not themseW^^ suited foF appUcstUon p^ 

Examnles of Broad Applicability nf Qntisim Methodology: > 
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Clinical Trial Design: 

At present, populations for clinical trials are generally chosen to be as homogeneous 
as possible. This is done to maximize the sensitivity of the tests to treatment effects. Due to 
the recognition that subpopulations may exist, there is growing pressure to move away from 
5 this paradigm, however, and run clinical trials in more heterogeneous populations. There are 
several scenarios under which this becomes beneficial: 

1) A significant subpopulation exists which reacts badly to the treatment being 
studied. Heritable allergies and drug interactions fall into this category. 

2) The intended benefit is not seen in the general population, but a significant 
10 subpopulation exists for which the intended benefit is realized and is of 

substantial value. 

3) There is an unanticipated benefit to the treatment, but one which is only 
realized in a: subpopulation. Hair growth stimulators and impotence relief 
treatments fall into this category. 

15 4) There is significant systematic variation in response to the treatment, either due 

to genetics (P450 or other metabolism) or because of interactions with other 
drugs. 

All of these subpopulation effects could potentially be detected by doing clinical trkds 
on more varied test populations. Since the relevant subpopulation will not, in genei^, be 
20 known ahead of time, the best approach is to create a representative diverse sample based^on 
a large number of personal characteristics - genetic, physical, lifestyle, etc. OptiSim is ideally 
suited for defining an appropriate test group from such information since it allows intelligent 
choices to be made between the ri^resentativeness andv diversify of tiie subsets of test 
populatidns td be iehosen. 

25 infernprSfe^fttng; 

The Intemiet represents an immense body of information wh^ch is expanding at a 
phenomenal rate. A major problem presented by so much readily accessible information is the 
ability to fmd the item(s) of interest. OptiSim may be applied to Internet sesurching to simplify 
the task of idieinitifyihg relevant informaticin. Af present, su&h s^hes^ of necess^Qr, operate 
30 off a hierarchy of k^drds and iridic^, fissidbs the sheai^ volume^o^infonnadon wMcte needs 
to be s^hed^ this apjjfcis^h is su^eeptiblb to varibus abui^ %-N&it >\ibUs^ 
conscious; and sonie ^dttitH; sCs We^ sist to th^idi^ynci^e^is^ 

Iff «%mM ffs lrM»i«i>i rtmiaV In 'tkfK tir^Y^ hVkW. dlfficultV 
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finding the information they seek. An alternative methodology enabled by Optisim would be 
to collect all addresses which might be relevant to the query, then use OptiSim to select 
diverse rq)resentatives and cluster the rest for perusal as appropriate. The user could by 
experience determine his/her personal value of AT. which most frequently returned information 
5 of most interest. 

OptiSim is a generalized dissimilarity selection methodology which includes the 
established methods of maximum and minimum dissimilarity selection as special cases. By 
varying the subsample size /C, one can adjust the balance between how representative the 
selected subset is and how diverse it is. Intermediate settings can mimic the results obtained 

10 for selection based on hierarchical clustering or, in some cases at least, improve upon them. 
Different embodiments of the invention disclosed in this document will consist of different 
datasets, different subset sizes, different dissimilarity measures, and different choices for the 
value of K. All are considered within the scope of the teaching of the Optisim invention. In 
addition, those skilled in the art will recognize that various modifications, additions 

IS substitutions, and variations to the illustrative examples set forth herein can be made without 
departing from the spirit of the inventions and are, therefore, considered within the scope of 
the invention. 
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App^nOix "A" 

©macro TILE sybyltable 

mmmMmmmmmmmmmmmmmmmmmmmmmmff 

globalvarTABLE!TILE!NO_RAND^K TANIMOTO 

5 localvar rowz ro_count ro_countO ro_set keep_ct 
localvar c colz coif kept type overflow 
localvar tempz tempf K kO R R2 i j keepers oncers 
localvar nada d df selz d_to_ks inin_d max_d q trials 

if %not( %table_default() ) 
10 echo TABLE TILE requires an open table, 
return 
endif 

s^var rowz $1 

setvar ro_count %table( "$rowz" row count ) 
15 while %lteq( $ro__count 0 ) 

setvar rowz %prompt( ROW_EXP "Rows to use" ) 
if $status 
return 
endif 

20 setvar ro^count %table( "$rowz" row count ) 

endwhile ^ 

setvar colz %table( "$2" COL NUMBER ) 
while %not( $colz ) 
setvar colz %prompt( COL_EXP "Columns to Use" ) 
25 if$status 
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return 
endif 

setvar colz %table( "$colz" COL.NUMBER ) . 
endwhile 

5 

setvar tempz 
setvar tempf 
for c in $colz 
setvar type %table( $c COLUMN DATAJTYPE ) 
10 if %set_member( "$type" INT,DOUBLE,FLOAT ) 
switch %table( $c COLUMN COLUMN_TYPE ) 
case FINGERPRINT) 
case ATOM_PAIR^FP) 
setvar tempf $tempf $c 
15 ;; 

case ) 
setvar tempz $tempz $c 

endswitch 
20 else 

echo "Sorry; %table( $c COL NAME ) is not suitable for this analysis." 
endif 
endfor 

if Stempz Stempf 
25 setvar colz Stempz 
setvar coif $tempf 
else 

echo No valid columns specified. 

return .^.-^ 
30 endif 
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setvar K $3 

whUe %not( %syb_int( "$k- ) ) 
setvar K %prompt( NATURAL 0 "Horizon sample size" \ 
"Size of sample to consider at each step; 0 tests all." ) 
5 if $status 
return 
endif 
endwhile 

setvar STATUS 

10 setvar R %promptif( "$4" POSITIVE "Cluster Radius" \ 

"Positive number specifying maximum allowed radius for bins." ) 
if Sstatus 
return 
endif 

15 setvar R2 %sqr( $r ) 

setvar M %promptif( "$5" POSITIVE "" "Max number of centers to identify" \ 

"Enter 0 to get all possible." ) 
if Sstatus 

return 
20 endif 

# setvar ro_set %set_create( $rowz ) 

m 

# Select rows then dn^ if there is missing data ^ 
- ## 

25 command TABLE DELAY YES 



# command TABLE SELECT ROW $ro_set 
command TABLE SELECT ROW $rowz 



wo 98/46998 

29 

for c in $colz 
setvar nada %cat( '{missingC $c ')}* ) 
command TABLE DESELECT ROW $nada 

endfor 

5 setvar rowz %table( {selectedQ} ROW NUMBER ) 
if %not( $rowz ) 

echo Sorry; no specified rows have all required data. 

return 
endif 

10 if 

H Begin partioning 
# 

ON INTERRUPT return 

15 # Prime the pump 
# 

setvar ro^count %table( {selectedQ} ROW COUNT ) 
if $TABLE!TILE!NO^RAND^K 
setvar q 1 

20 setvar keepers %arg( $q Srowz ) 
else 

setvar q %math( ( %irand( $6 ) % $ro_count ) + 1 ) 
setvar keepers %arg( $q Srowz ) 
setvar nada %arg( $ro_count Srowz ) 
25 setvar rowz %setarg( $q "Snada" Srowz ) 
endif 

setvar keq)_ct 1 

setvar rojcountO Sro_count 

setvar ro count %niath( $ro count - 1 ) 
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setvar oncers 
setvar overflow 

if %eq( $k 0 ) 
# setvar k $ro_count 
5 if $TABLE!TILE!NO_RAND_K 

setvar kO $TABLE!TILE!NO_RAND_K 
else 

setvar kO $ro__count 
endif 
10 else 

setvar kO $k 
endif 



if %eq( $m 0 ) 
setvar m $ro_count 
15 endif 
#XXXX 
if$tanimoto 

setvar tanimoto %sqrt( %inath( 1 - $r2 ) ) 
«idif 

20 WHILE %and( " %gt( $ro_count 0 )" " %gt( $M $keq)_ct )" ) 
setvar selz 

setvar min_ds ^ 
WHILE %and( "%lt( %count( $selz ) $kO )" "%^t( $ro_count 0 )" ) 
if %gt($kO) 

25 setvar q %inath( ( %irandO % $ro_count ) + 1 ) 
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setvar i %arg( $q $rowz ) 
setvar nada %arg( $ro_count $rowz ) 
setvar rowz %setarg( $q "$nada" $rowz ) 
else 

5 # setvar i %arg( $ro_count $rowz ) 

setvar i %arg( %math( $ro_countO - $ro_count + 1 ) $rowz ) 
endif 

setvar ro_count %math( $ro_count - 1 ) 

setvar try_these $keepers 
10 if Stanimoto 

setvar cardj %rcell( $i $colf value ) 
setvar try_these 
forj in$cards[0] 
setvar minT %min( %math( $cardj / $j ) %math( $j / $cardj ) ) 
15 if %gt( %min( SmaxT $minT ) $temp_max ) 

setvar try_these $try Jhese $cards[ $j ] 
endif 
endfor 
endif 

20 setvar d_to_ks 

setvarj 

for j in $try_these 

setvar d %table_tile_distance( $i $j "$oolz" "Scoir ) ^ 

if %gt( $d $r2) 
25 setvar dJo_ks $dJo_ks $d 

else 
break 
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endfor 

if%not($j) 

setvar selz $selz $i 

if $d_to_ks 
setvar min_ds $rnin_ds %min( $d_to_ks ) 

endif 
endif 

if %and( "%lteq( $ro_count 0 )" "%gt( $k 0 )" ) 
ifSoncers 
setvar rowz Soncers 
setvar ro_count %count( $rowz ) 
setvar oncers 
endif 
endif 

ENDWHILE 
ifSselz 

setvarniax_d %max( $min_ds ) 

setvar q %arrayj)os( "Smaxji" $nun_ds ) 

setvar kept %arg( "$q" $selz) ^ 

setvar keepers $keq)ers $kspt 

setvar keep_ct %math( $keep_ct + 1 ) 

setvar selz %item_remove( $q $selz ) 
setvar sdz %setarg( $q $selz ) 
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setvar oncers Soncers $selz 



if %lteq( $ro_count 0 ) 

setvar rowz $oncers 

setvar ro_count %count( $rowz ) 
S setvar oncers 

endif 

if Soverflow 

echo -n "$kept is center no. %count( $keepers ) ... " \r 

else 

10 if %lt( %strlen( '^Skeepers" ) 100 ) 

echo -n "Found %count( Skeepers ) centers: " $keepers \r 
else 
echo 

setvar overflow TRUE 
15 echo -n "$kept is center no. %count( Skeepers ) ... " \r 

endif 
endif 



\ 



endif 



ENDWHILE 

20 if Skeepers 

edio "Found %count( Steqpers ) colters: " Skeepers 

if $TABLE!TILE!NO_RAND_K 
echo "Accessed all but last $ro_count rows." . i. 

«idif 
25 else 

echo No centers found! 
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endif 

command TABLE DELAY NO 

command TABLE SELECT ROW %set_create( Skeepers ) 



5 

mmmmmmm»mmmmmmmmmmm»mmmmmmm 



M Returns distance squared for use in tiling macro. ## 
NB: $3 and $4 are lists, not sets, of column ids 
10 m ## 
## Bob Clark 10/9/96 ## 



mmmmmMmmmmmmmmmmmmmmmmmmmitm 

@expression_generator table_tile_distance 

15 if %not( %gt( $# 2 ) ) 

echo 'Usage: %table_tile_distance( rowjd "scalar_cols" "finger_cols" )' 
return 
endif 

localvar c d df coif colz i j 

20 setvar i $1 

setvarj $2 ^ 
setvar colz $3 

setvar coif $4 *' ' 



setvar d 0 



wo 98/46998 PCT/US98/07214 

35 

for c in $colz 

setvar d %math( $d + %sqr( %math( %rcell( $i $c VALUE ) - %rcell( $j $c VALUE 
) ) ) ) 
endfor 
5 ## 

# Check fingerprint columns 
## 

for c in $colf 
switch $TAILOR!SELECTOR!METRIC 
10 case TANIMOTO) 

setvar df %tanimoto( $c $j $i ) 



case COSINE) 
setvar df %cosineco( $c $j $i ) 

• • 

case EUCLIDEAN) 
setvar df %fing_euclid( $c $j $i ) 

endswitch 



15 



20 setvar d %math( $d + %sqr( %math( 1 - $df ) ) ) 
endfor 

%retum( $d ) 



©expression ^generator subset^sim 

it 

25 localvar c colz d colz i j 
localvar fingz 

localvar min_ds min_this nada oops 



wo 98/46998 



PCT/US98/07214 



36 

localvarq quCTy_ct query_set 
localvar ref_ct rcf_set 
localvar scalz tanz stats 

setvar oops 

5 if %not( %eq( $# 3 ) ) 
setvar oops TRUE 
else 

setvar colz %table( "$r col number ) 
setvar ref_set %table( "$2" row number ) 
10 setvar ref_ct %count( $ref_set ) 
setvar query_set $3 

setvar query_ct %table( "$3" row count ) 
endif 

if %not( $ref_set ) 
15 setvar oops TRUE 
endif 

if %lt( $query_ct 1 ) 
setvar oops TRUE 
mdif 

20 if %neq( %oount( $colz ) 1 ) 

setvar oops TRUE , 
endif 

it' 

if $oops 

echo 'Usage: %subset_sim( ref_rows query^rows metric^column )* 
25 return 
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endif 

for c in $colz 
setvar nada %cat( '{missingC $c ) 
setvar foo %mss_set_create( %table( "Snada" row number ) ) 
5 if $foo 

setvar query_set %mss_set_diff( $query_set $nada ) 
endif 
endfor 

if %not( $query_set ) 
10 ERROR Too many rows have missing data 
return 
endif 

setvar scalz 
setvar fmgz 

15 for q in $colz 

setvar nada %table( $colz col data ) 
if %not( %set_member( "Snada" DOUBLE,INT,FLOAT ) ) 
echo 'Metric column %table( Sqeol name) is inappropriate; ' skipping. 
continue 
20 endif 

switch %table( $q col column^type ) 

case FINGERPRINT) ^ 
case ATOM^P AIR^FP) 

setvar fingz $fmgz $q 
25 ;; 

case) 

setvar scalz $scalz $q 



wo 98/46998 



PCTAJS98/07214 



38 



endswitch 
endfor 



setvar min_ds 

5 for i in $ref_set 

setvar nada %cat( "$query_set" 'y $i ) 
setvar q %table( $nada row number ) 



if %not( $q ) 

echo 'No rows to compare to * %table( $i row name); skipping... 
10 continue 
endif 



setvar min_this 
for j in $q 

setvar d %table_tile_distance( $i $j "Sscalz- "Sfingz" ) 
15 if %not( $d ) 

echo No distance returned for $i $j 

continue 
endif 



if %not( Sminjhis ) 
20 setvar min_this $d 

endif 

if %lt( $d $min_this ) 
setvar min_this $d 
endif 
25 endfor 
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if$inin_this 

setvar niin_ds $min__ds %sqrt( $inin_this ) 
endif 
oidfor 

S setvar stats %stats( $min_ds ) 
%retum( T$stats" ) 



©macro SUBSET_SIMILARITY CHOM 

localvar colz nada ques refs stats 
10 localvar reps divers tabx 

setvar GLOBALISTATUS 

ON INTERRUPT goto flee_chom_subset_similarity 

setvar tabx %table_default() 

if $tabx 
15 echo 

edio working table is $tabx 
else 

echo No table is currently active 
return 
20 endif 

setvar colz %promptif( "$1" COL_EXP "{sdectedQ}" \ 
"Metric column from which to calculate (dis)sinularities." ) 
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if $global!STATUS 

return 
endif 

if %lt( %table( "$colz" col count ) 1 ) 
S echo Sony; expression $colz evaluates to no columns, 
return 
endif 

for nada in %table( $colz col data ) 
if %not( %set_member( $nada DOUBLE,INT,FLOAT ) ) 
10 echo ^Metric column type is inappropriate for * %table( $nada col name) 

return 
endif 
endfor 

setvar refs %promptif( "$2" ROW_EXP "{selectedQ}" \ 
IS "Rows to use as refemce points" ) 
if$global!STATUS 

return 
endif 

if %U( %table( "$refs" row count ) 1 ) 
20 echo Sorry; expression $refs evaluates to no rows, 
return 
endif 

setvar ques %promptif( "$3" ROW_EXP \ 
"Rows for which you wish to calculate (dis)similarities." ) 
25 ifSgloballSTATUS 

return ^- •r--*"" 
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endif 

if %lt( %table( "$ques" row count ) 1 ) 
edio Sorry; expression $ques evaluates to no rows, 
return 
5 endif 

setvar reps %subset_sim( $colz. $ques $refs ) 
if %not( $reps ) 
echo Evaluation of intersubset siniilarity failed 
10 return 
endif 

setvar divers %subset_sim( $colz $refs $refs ) ^ 
if %not( Sdivers ) 
echo Evaluation of intrasubset sinularity failed 
15 return 
radif 

echo 

echo " MEAN MEDIAN SD " 

echo 

20 echo "BETWEEN SETS: %piintf( "%10.4f %10.4f % lOAf %aig( 1 $rq)s) %arg( 2 Steps 
) %arg( 4 $reps) )" 

echo " WmnNSET: %printf( " % 10.4f % 10,4f % 10 Af %arg( 1 Sdivers) %arg( 2 Sdivers 
) %aig( 4 Sdivers) )" ^ 
echo 

25 flee_chom_subset_similarity: 
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CLAIMS 

What is claimed is: 

L A computer implemented method using predetermined characteristics representative of 
the members of a population of selecting a member group, of predetermined size M, 
5 from the population comprising the following steps: 

' a. randomly select a member from the population, add it to the group, create a 

pool of candidate members out of the remainder of the population, and create 
an empty recycling bin; 

b. randomly choose a member from the candidate pool and determine if it has a 
10 dissimilarity less than R with respect to those members already selected for the 

group, and, if it does not have a dissimilarity less than R, place it in a 
subpopulation; 

c. if the member selected in step b has a dissimilarity less than if to those 
members already selected for the group, discard that member; 

15 d. repeat steps and c until the subpopulation includes K members at least R 

dissimilar to those already selected for the group or until the candidate pool is 
empty; 

e. if the candidate pool is empty, remove all members from the recycling bin and 
put them into the candidate pool; 
20 f, if the candidate pool is not empty, and, if there are fewer than K members in 

the subpopulation, go to step b; 
g^; ^ if subpopifliatitm is eni^ty» teiinate the selection process; 
h* if the subpopulation is not empty, examine the subpopulation and identify a 
member masdmally disattillar to those members already selected for the group; 
25 i. add the member identified in step h to the group an^ remove it from the 

subpopulation; 

j. put those members in the subpopulationV which were not added to the group, 

into the recycling bin; and 
k. if the desired selected group size M has been reached, terminate the sdection 
30 process; 
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I. if the desired selected group size A/ has not been reached, return to step 
wherein the value of K is selected to determine the representativeness or diversity desired in 
the subset and the value of J? is selected to determine the minimum dissimilarity between 
members. 

5 2. The method of claim 1 in which at step j, the members are not put into the recycling 
bin, but rather are immediately put back into the candidate pool. 
3- A computer implemented method, using predetermined characteristics representative 
of the members of a population, of selecting a member group, of predetermined size 
M, from the population comprising the following steps: 

10 a. randomly select a member from the population and add it to the group; 

b, for each additional memb^ to be included in the group, randomly create a non* 
redundant subpopulation from the whole population; 

c, select, using a preference criterion, the member from the subpopulation which 
best satisfies the preference criterion with respect to those members already 

15 selected for inclusion in the group; 

d, repeat steps a and until M members are selected or until a non-redundant 
subpopulation can not be obtained. 

4. The method of claim 3 in which the subpopulation size is varied to adjust the 
representativeness or diversity of die group. 

20 5. A computer implemented method, using predetermined characteristics representative 
of the members of a population, of selecting a member group, of predetermined size 
M, ^om the population comprising drawing the members for testing for inclusion in 
the group from a subpopulation the size of which is varied to balance the 
rq^iesentativeness and diversity of the^ members selectedv 

25 6. The method of daim S in which the subpopulation £5 liandomly setected ^a^^ non- 
redundant. 

7. The method of claim S in which the testing of the members for inclusion in the group 
comprises satisfying a preference criterion. 

8. The method of claim S utilhdng subpopulations of size K, where K is not equal to 1 , 
30 in which the numb^ of iterations necessary to select the groupls tess than or equal to 

the population size and typically equals the square root of the population^ size. 
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A computer implemented method, using predetermined characteristics rq)resentative 
of ttie members of a population, of selecting a member group, of predetermined size 
Af, from the population comprising utilizing subpopuiations of size K, where K is not 
equal to 1 , in which the number of iterations necessary to select the subset is less than 
or equal to the population size and typically equals the square root of the population 
size. 

The group of members selected, using predetermined characteristics representative of 
the members of a population, from a larger population by the method comprising the 
following steps: 

a. randomly select a member from the population, add it to the group, create a 
pool of candidate members out of the remainder of the population, and create 
an empty recyding bin; 

b. randomly choose a member from the candidate pool and determine if it has a 
dissimilarity less than R with respect to those members already selected for the 
group, and, if it does not have a dissimilarity less than R, place it in a 
subpopulation; 

c. if the member selected in step k has a dissimilarity less than R to those 
members already selected for the group, discard that ^member; 

d. repeat steps b and £ until the subpopulation iflcludes? !^ members at least R 
dissinular to those already selected for the group or until the candidate pool is 

■ . '.empty; = ■ / ■■ . 

e. if the candidate pool is empty, remove all members frond the recycling bin and 
put them into the candidate pool; 

f . if the candidate pool is not empty, and, if there are fewer than K members in 
tha subpopulation, igoto^^ 

g. if the subpopulation is empty, terminate the selection process; 

h. if the subpopulation is not empty, examine the subpopulation and identify a 
member maximally dissimilar to those members already selected for the group; 

i. add the member identified . in step h tp the group and remove it- from the 
• • subpopulaUon;- . . ■ -^a ■ 

j. put tho^ members in the subpopulation^ wtuchr^ 
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into the recycling bin; and 
k. if the desired selected group size M has been reached, terminate the selection 
process; 

1. if the desired selected group size M has not been reached, return to step b 
5 \^eiein the value of is selected to determine the representativaiess or diversity desired in 
the subset and the value of R is selected to determine the minimum dissimilarity betwe^i 
members. 

1 1 . The group of members selected , using predetermined charactoistics representative of 
the members of a population, from a larger population by the method comprising the 

10 following steps: 

a. randomly sdect a member from the population and add it . to the group; 

b. for each additional member to be included in the group, randomly create a non- 
redundant subpopulation from the whole population; 

c. select, using a preference criterion, the member from the subpopulation which 
15 best satisfies the preference criterion with respect to those members already 

selected for inclusion in the group; 

d. rqpeat s^s a and \bl until M members zxt selected or until a non-redundant 
subpopuladon can not be obtained. 

12. The group of members of claim 1 1 in which ttie subpopulation »ze is varied to adjust 
20 the represieatativeness or diveraty of &e group. 

13. The group of members selected, using predetermined characteristics rq)resentative of 
the members- of a populati<»i, from a larger population by the method comprising 
drawing the membm for testing for inclusion in the grot^ from a subpopulation the 
size of which is varied to balance the represaitativeness and divecsit^ of the memb^s 

25 selected. 

14. The group of members selected, using predetemnned characteristics rqnesentative of 
the members of a population, from a larger population by the method comprising 
utilizing subpopulations of size K, where K is hot equal to 1, in which the number of 
iterations necessary to select the subset is less than or equal to J/and ^ically equals 

30 the square root of JV. 

15. A compute implemi»ited m^od for clustering members of a large populaUon>: using 
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predetermined characteristics representative of the members of a population, in which 
the center of the clusters, to be formed are determined by the groups selected from the 
population by the method comprising the following steps: 

a. randomly select a member from the population, add it to the group, create a 
S pool of candidate members out of the remainder of the population, and create 

an empty recycling bin; 
b* randomly choose a member from the candidate pool and determine if it has a 
dissimilarity less than R with respect to those members already selected for the 
group, and, if it does not have a dissimilarity less than R, place it in a 
10 subpopulation; 

c. if the member selected in step b has a dissimilarity less than R to those 
members already selected for the group, discard that member; 

d. repeat steps and £ until the subpopulation includes K members at least R 
dissimilar to those akeady selected for the group or until the candidate pool is 

15 empty; 

e. if the candidate pool is empty, remove all members from the recycling bin and 
put them into the candidate pool; 

f. if the candidate pool is not empty, and, if there are fewe^r than K members in 
the subpopulation, go to stq> b; 

20 g. if the subpopulation is empty, tenninate the se^ 

h. if the subpopulation is not empty, examine the subpopulaUon and idratify a 
^ member maximally dissimilar to those members already selected for the group; 

i. add the member identified in stq) h to the group and remove it from the 
subpopulation; 

25 j. put those members in the subpopulation, which were not added tQ^the group, 

into the recycling bin; and : ^ 

k. if the desired selected group size M has been reached, terminate the selection 
process; 

L if the d&sired selected group size M has not been reached, re^mr to step li 
30 wh^ein the value of Kis selected to determine the rq)resentativenes$: or diversity desired in 
the subset and the value of i? is selected to det^ine the; minifflumrd^ betwera 
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membns. 

16. A computer implemrated method for clustering members of a large population, using 
predetermined characteristics, rq)resentative of the members of a population, in which 
the center of the clusters to be formed are determined by the groups selected from the 

5 population by the method comprising the following steps: 

a. randomly select a member froni the population and add it to the group; 

b. for each additional member to be included in the group, randomly create a non- 
redundant subpopulation from the whole population; 

c. select, using a preference criterion, the member from the subpopulation which 
10 best satisfies the preference criterion with respect to those members already 

selected for inclusion in the group; 

d. repeat steps a and b until M members are selected or until a non-redundant 
subpopulation can not be obtained. 

17. The method of daim 1 in which the predetermmed characteristic representative of the 
15 members of the population is chemical structure. 

18. The method of claim 3 in which the predetermined characteristic representative of the 
members of the population is chemical structure. 

19. The mediod of daim 5 in which the predetermined characteristic rq>resentative of the 
membos of the population is chemical structure. 

20 20. The method of daim 9 in which the predetCTmined characteristic r^resentative of the 
membo^ the population is chemical structure. 

21 . The method of claim 15 in which the predetermined characteristic rq>resentative of the 
mmbers of the population is chemical struc|ure. 

22. The method of claim 16 in which the predetermined characteristic representative of the 
25 members of the peculation is chemical structure. 
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CREATE A CANDIDATE POOL FROM ALL 
COMPOUNDS IN THE DATASH OF INTEREST. 

■■■■I > . ■■ ■ 



PICK A COMPOUND C AT RANDOM 
FROM THE CANDIDATE POOL 



1 



REMOVE C FROM THE CANDIDATE POOL AND 
PUT IT INTO TH^ SELECTION SEl 



I 



REMOVE ANOTHER COMPOUND C FROM 
THE CANDIDATE POOL AT RANDOM. 




TRANSFER THE CONTENTS 
OF THE RECYCLING BIN 
TO THE CANDIDATE POOL 




DONE 



ADD C TO THE SUBSAMPLE 




nND THE BEST COMPOUND C IN THE SUBSAMPLE - eg. 
THAT WHICH IS FARTHEST (LEAST SIMILAR) TO ANYTHING 
IN TH! SEyECTION SET, 



TRANSFER C FROM THE SUBSAMPLE TO THE 
" SELECTION SEl. 



T 



TAKE Aa OTHER COMPOUNDS OUT OF THE SAMPLE 
AND PUT THEM. INTO THE RECYCUNG BIN. 





■I DONE I 
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